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AUSTRALIA 
Patents Act 1990 

Proteome Systems Intellectual Property Pty Ltd 
PROVISIONAL SPECIFICATION 

Invention Title: 

A method for determining the biological likelihood of theoretical 

compositions or structures 

The invention is described in the following statement: 
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Field of the Invention 

This invention relates to a method of determining the biological likelihood of 
theoretical compositions or structures, particularly glycans. 

5 Backfiaronnd of the Invention 

Glycans (sugar structures/oligosaccharides) are usually composed of varying 
numb^ of less than a dozen biologically-occurring monosaccharides. When 
considered purely in terms of their masses there are usually only about 3-6 different 
mass-unique monosaccharides in a typical glycan structure. The most frequently 

10 encountered unique-mass monosaccharides are Hex (mass 162 Da; includes all hexose 
monosaccharides), HexNAc (mass 203 Da; includes all acetamidohexose 
monosaccharides), dHex (mass 146 Da; includes all deoxyhexose monosaccharides), 
Pent (mass 132 Da; includes all pentose monosaccharides), and NeuAc (mass 291 Da; 
N-acetylneuraminic (sialic) acid). There are sev^al other biologically extant, fliough 

IS less-firequently encountered component monosaccharides, such as KDN, HexA, 
NeuGc. Other non-monosaccharide adducts such as sulfate (S; mass 79.97 Da), 
phosphate (P; mass 97.98 Da), methyl (14 Da), and acetyl (4JX Da) are also occasionally 
observed on biologically-occurring oligosaccharides. 

It is often the case during the characterisation of biological molecules that a 

20 precise mass may be ascertained for each biological molecule but its composition and 
identity are unknown. Given a reasonably accurate mass, such as would normally be 
obtained by mass spectrometiy, the monosaccharide composition of an unknown 
glycan can be theorised by determining, by computation, the set of monosaccharide 
compositions that are within a reasonable mass deviation (or tolerance) of the observed 

25 mass. This approach forms the . basis of glycomod 

fhttp://us.exDasv.org/tools/glvcomod/) a publicly available research tool. The 
shortcomings of this tool, and of this purely theoretical approach by extension, is that a 
large nxmiber of compositions are returned for any mass of larger than moderate size, 
and that the majority (90-99%) of these have little in common with known biologically 

30 extant compositions. 

The aim of the present invention is to attempt to alleviate some of fhe above 
described problems and to reduce the large number of irrelevant compositions returned 
by existing tools. 

Any discussion of documents, acts, materials, devices, articles or the like which 
35 has been included in the present specification is solely for the purpose of providing a 
context for the presoit invention. It is not to be taken as an admission that any or all of 
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these matters form part of the prior art base or were common general knowledge in Ihe 
field relevant to the present invention as it existed before the priority date of each claim 
of this application. 

S Summary of the Invention 

In a first broad aspect, the present invention incorporates statistical measures of 
biological relevance for the theoretical compositions returned. 

Typically, biological relevance, expressed as a numerical score, or biological 
index, is determined by statistical comparison to an established reference set of known 
10 and fully characterised compositions, in the case of glycans a reference set such as the 
Glycosuite fhttD;//www.glvcosuite.com) database. The biological index of any given 
composition may tiien be used as a basis for discarding biologically 'hmlikely" 
compositions, as well as for ranking (sorting) of returned compositions by biological 
likeliness. 

15 Empirically, for glycans this allows between 90-99,9% of theoretical 

compositions returned by any given search to be discarded, whilst preserving and 
ranking the remaining, biologically likely compositions. 

In one aspect the present invention provides a method of determining the 
likelihood of a theoretical candidate composition comprising the steps of: 
20 selecting a reference group of known characterised compositions; 

establishing statistical characteristics relating to components of or other features 
of the known characterised composition; 

comparing the statistical charact^stics of the known characterised 
compositions with corresponding components or features in the theoretical candidate 
25 coinpositions to establish a likelihood of those compositions occurring. 

More particularly, for glycans, in one aspect the present invention provides a 
method of characterising glycans comprising the steps of: 

providing a search mass of a glycan whose composition is to be determined; 
generating a list of Aeoretical glycans made up of components, including 
30 monosaccharides, whose total mass is within a predetermined tolerance of the search 
mass; 

selecting a reference gcoup of known characterised glycan compositions of 
approximately similar mass to the search mass; 

establishing the mean and standard deviation of each component gqipearing in 
35 the reference group of the known characterised glycan compositions; 
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for each theoretical glycan candidate calculating a partial score for each 
component in that theoretical glycan candidate based on the difference between the 
observed number of the component in the theoretical glycan candidate and the mean for 
that component in the reference group, divided by the standard deviation; 
5 combining the partial scores to provide an indication of the likelihood of that 

theoretical glycan candidate occurring. 

The partial scores may be combined in any suitable manner. One way, for 
example, is by multiplying the partial scores together. 

By the use of actual biological information, the present invention is able to 
10 discern biologically likely conq>ositions ftom the vast majority of compositions of 
similar mass, but v^hose compositions are differ greatly from known, biologically 
extant compositions. For example, for glycans where the publicly available web tool 
glycomod returns over 100 theoretical compositions for the mass 1300 Da+/- 0.5 Da, a 
tool embodying the present invention returns 2 biologically likely compositions and 
15 1 09 biologically unlikely compositions (which would normally be discarded). 

Although the main application of the present invention is to the delineation of 
biologically likely and unlikely sugar compositions for the purposes of sugar 
structure/composition elucidation, the generic methodology of using known biological 
data as a means to refine, interpret, and/or rank theoretical or empirical data may be 
20 used for many other applications. 

Detailed Description of a Preferred Embodiment 

The present invention is implemented on a computer means running software 
carrying out the algorithms and process of the method. 

The first input to a search using a method to determine a glycan composition 
25 (the "search glycan") embodying the present invention, is a search mass (which is 
typically in Daltons). The search mass is typically an empirically detranined mass of 
the "search glycan" which is to be characterised determined by mass spectrometry or 
other means Le, the mass of the search glycan whose composition is to be determined. 
A search mass tolerance (in Daltons) is also input. Typically this will be a 
30 relatively small value depending on the expected accuracy of the empirically 
determined search mass and typically may be of the order of ± O.lDa. Also input is a 
^'biological index" cut-off. The biological index is a measure of a theoretical glycan 
composition's likelihood and its derivation is explained in more detail below. The cut 
off is the value of that index above which candidate compositions are discarded as 
35 being too unlikely to occur in the real world Also input is a ''maximum composition*' 
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which indicates the maximum allowable mmiber of each monosaccharide in each 
theoretical glycan composition. By way of example, if it were known for a fact that the 
glycan to be characterised contained no sialic acid, the theoretical glycan compositions 
generated as potential matches for the search mass woxild also exclude sialic acid. This 
5 reduces the amount of computatioa required and improves speed and accuracy. In the 
system implementing the method, defaults would typically be provided for those inputs, 
except of course for the search mass. 

Other optional parameters may also be exposed to the user to further modify the 
performance of the search. The output of the composition search is a list of candidate 
10 theoretical glycan compositions, whose mass is within the search mass tolerance of the 
search mass, and whose biological index is less than the biological index cut-ofF. In 
theory one of those candidates matches the composition of the search glycan. 

The composition search is perfomied as follows: 

Reference statistics for the given search mass are determined from the 
15 (Glycosuite) database. This process is described in more detail below. 

Monosaccharides are recursively recombined in varying nimibers such that 
every possible combination of allowed monosaccharides is created. Compositions 
whose mass does not fall within the search mass tolerance are discarded, as are 
compositions for which the number of any monosaccharide exceeds the maximum 
20 nimiber of that monosaccharide specified by the "maximum composition". Hie result is 
a list of theoretical candidate glycan compositions. 

The biological index of candidate compositions is determined as described 
below. Compositions whose biological index does not satisfy the biological index cut- 
off are discarded. 

25 The remaining compositions are presented to the user in order of biological 

index. Typically the list will be short and may only include one or two candidates. 

This compares with the hundreds of candidates typically produced by Glycomod, each 

of which has to be individually reviewed and assessed. 

Calculation of Bioloisdcal hidex 
30 Inputs to the process are a composition, and a reference data set of known sugar 

compositions/stmctures. The reference set may be from any sxiitable database or data 

source such as Glycosuite. The output of the process is a numerical biological index. 

The determination of biological index for a given search glycan composition 

proceeds as follows: 

35 The mass of the composition is the search mass or may be determined by the • 

sum of the residue masses of each monosaccharide/component in the composition. 
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By reference to the reference set of known biological compositions, the mean 
and standard deviation of every monosacdiaride/component in flie database within an 
arbitrary mass range (eg: +/- 200 Da) of the mass of the composition is detennined 
Obtaining statistics from a range of masses around the given composition's mass is 
5 necessary in order to obtain a sufBcientiy large sample size (preferably at least 100 
known compositions). In the case of the Glycosuite database of known sugar 
structures, a mass tolerance of 200 Da was empirically determined to be sufficient to 
provide in excess of 100 known compositions for search masses up to around 3500. 

By way of example if the search mass were 1000 Da there may be 100 known 
10 glycans in the database whose mass is between 800 and 1200 Da. The mean and 
standard deviation of each of evay monosacchaiide/component appearing in those 
known glycans in fee database is then determined. If we take HexNAc as an example 
we may find that, on average, the 100 known glycans contain 3.3 HexNAc 
monosaccharides with a standard deviation of 2.3. This process is repeated to cal«5ulate 
15 the mean and standard deviation for each monosaccharide componait Hex. dHex, pent 
et al. and each adduct in the known glycans, if adducts are being accounted for. 

For each theoretical candidate glycan composition "Partial scores" are then 
determmed from tiie means and standard deviations calculated above. These are 
«dculated for each monosaccharide in flie given composition as the absolute value of 
20 the difference between the mean nmnber of that monosaccharide in tiie reference set 
and the observed number of that monosaccharide in the theoretical candidate 
composxtion, divided by the standaid deviation of that monosaccharide in compositions 
trom the reference set, ie: 

25 partialscore - ^'^-onosac -observed^ J 

stdev 

monosac 

where mean^„o^ is tiie mean number of tiie given monosaccharide in tiie 
ref^ence data set (Glycosuite); mean^„„^ is tiie number of tiie given monosaccharide 
m the tiieoretfcal candidate composition; and stddev^^ is flie standard deviation of 
30 tiie given monosaccharide in tiie reference data set. 

By way of example if flie flieoietical glycan composition includes two HexNAc 
tiu-ee Hex and 1 NeuAc, flie partial score for each of fliose flnee monosaccharides is 
ca^cu ated for that tiieoretical candidate glycan composition. Partial scores need not be 
calculated for monosaccharides which do not appear in flie candidate tiieoretical glycan 
35 composition. * 
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In the event that the mean^„o,ac equals the mean„„„osac for a particular glycan, 
the system is arranged to give the partial score a minimum value of 0.01 . 

Thus, the partial score of a monosaccharide is in feet the number of standard 
deviations the number of away from Ihe mean that that monosaccharide is in the 
5 theoretical candidate composition. In a nonnal distribution, approximately 68% of all 
data points lie within 1 standard deviation of the mean,' -93% within 2 standard 
deviations, over 99% within 3. Assuming that the distributions of monosaccharide 
number for the mass range used to obtain the initial means and standard deviations for 
the given search mass are sufficiently close to normal, then partial scores of 3 or less 
10 for any monosaccharide indicate that the number of those monosaccharides are within 
99% of all compositions of similar mass in Glycosuite. 

Partial scores are then combined in some mamier to derive a single numeric 
score; this being the biological index. The actual mathematical derivation of the 
biological index may be arrived at using multiple means; dififerent formulae exhibit 
15 subtlely different qualities in their sensitivity to large partial scores and other criteria. 
For this reason, biological index for the purposes of the present invention may be 
considered merely as a numerical value that is representative of. and derived from, the 
magmtudes of the differences between a given composition and a population of known 
compositions of a similar mass. Presently, a biological index is calculated from partial 
20 semes as the product of aU the partial scores from the theoretical candidate 
composition; ie: 

monosac^ 

liP^r-tialscore^^^^ 

ntonosacQ 

The Biological Index is adept at excluding veiy poor matches but at the same 
time If a candidate theoretical glycan composition has a very large (i.e. poor) partial 

25 score for one monosaccharide but low partial scores for the other monosaccharide 
components, the candidate may have an acceptably low Biological Index hence the 
system does not discard candidates which have only one poor partial score. 

The process of calculating the partial scores is carried out for each theoretical 
glycan composition as discussed above. Compositions whose biological index does not 

30 satisfy tiie biological index cut-off are discarded, Hie remaining compositions are 
presented to the user in order of biological index. Typically the list wiU be short and 
may only include one or two candidates. This compares wifli the hundreds of 
candidates typically produced by Glycomod, each of which has to be individually 
reviewed and assessed. 
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The key element of the present invention is the we of biological data as a means 
to score the quality of theoretical data, in this case, sugar compositions. The actual 
manner in which a biological score/index is calculated is largely arbitrary; different 
formulae for calculating a biological index exhibit different characteristics with respect 
5 to their tolerance to large compositional differences from the detennined mean, and in 
their propensity to extrapolate the compositions present in the reference database. 

Although the present invention as described above is concerned with the use of 
known sugar structures/compositions as a means to discern/elucidate monosaccharide 
composition given only a mass, the concept could be extended to other compositions 

10 and to the use of other structural characteristics, for example linkage and branching, as 
reference data for determining and/or ascertaining the quaUty .of complete sugar 
structures for other investigative techniques, such as glycan fragment mass 
fingerprinting (see tiie ^licanfs co-pending provisional patent appKcation No 
2003902907, the entire contents of which are incorporated herein by reference). 

15 It will be appreciated by persons skilled in the art tiiat numerous variations 

and/or modifications may be made to the invention as shown in the specific 
embodiments wifliout departing from the spfrit or scope of the invention as broadly 
described. The present embodiments are, therefore, to be considered in all respects as 
illxistrative and not restrictive. 
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